Readability Classification for German using Lexical, Syntactic, and Morphological Features

نویسندگان

  • Julia Hancke
  • Sowmya Vajjala
  • Walt Detmar Meurers
چکیده

We investigate the problem of reading level assessment for German texts on a newly compiled corpus of freely available easy and difficult articles, targeted at adult and child readers respectively. We adapt a wide range of syntactic, lexical and language model features from previous research on English and combined them with new features that make use of the rich morphology of German. We show that readability classification for German based on these features is highly successful, reaching 89.7% accuracy, with the new morphological features making an important contribution. TITLE AND ABSTRACT IN GERMAN Lesbarkeitsklassifizierung für das Deutsche mit lexikalischen, syntaktischen und morphologischen Merkmalen Wir untersuchen das Problem der Lesbarkeitsklassifizierung für deutsche Texte anhand eines neuen Korpus frei zugänglicher Artikel, die einerseits Erwachsene und andererseits Kinder als Zielgruppe haben. Wir adaptieren eine Vielzahl syntaktischer, lexikalischer und language model Merkmale aus der englischen Lesbarkeitsforschung und kombinierten sie mit neuen Merkmalen, die sich die ausgeprägte Morphologie des Deutschen zu Nutze machen. Wir zeigen, dass diese Merkmale sehr erfolgreich dazu eingesetzt werden können, deutsche Texte nach ihrer Lesbarkeit zu klassifizieren. In unseren Experimenten erreicht die Klassifikation eine Genauigkeit von 89,7%, wozu die neuen morphologischen Merkmale einen wichtigen Beitrag leisten.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories

I investigate Russian second language readability assessment using a machine-learning approach with a range of lexical, morphological, syntactic, and discourse features. Testing the model with a new collection of Russian L2 readability corpora achieves an F-score of 0.671 and adjacent accuracy 0.919 on a 6-level classification task. Information gain and feature subset evaluation shows that morp...

متن کامل

A Semantically Oriented Readability Checker for German

One major reason that readability checkers are still far away from judging the understandability of texts consists in the fact that no semantic information is used. Syntactic, lexical, or morphological information can only give limited access for estimating the cognitive difficulties for a human being to comprehend a text. In this paper however, we present a readability checker which uses seman...

متن کامل

On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition

We investigate the problem of readability assessment using a range of lexical and syntactic features and study their impact on predicting the grade level of texts. As empirical basis, we combined two web-based text sources, Weekly Reader and BBC Bitesize, targeting different age groups, to cover a broad range of school grades. On the conceptual side, we explore the use of lexical and syntactic ...

متن کامل

Automatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features

This paper presents a classifier of text readability based on information-theoretic features. The classifier was developed based on a linguistic approach to readability that explores lexical, syntactic and semantic features. For this evaluation we extracted a corpus of 645 articles from Wikipedia together with their quality judgments. We show that information-theoretic features perform as well ...

متن کامل

Simple or Complex? Assessing the readability of Basque Texts

In this paper we present a readability assessment system for Basque, ErreXail, which is going to be the preprocessing module of a Text Simplification system. To that end we compile two corpora, one of simple texts and another one of complex texts. To analyse those texts, we implement global, lexical, morphological, morpho-syntactic, syntactic and pragmatic features based on other languages and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012